Information Extraction using Non-consecutive Word Sequences

نویسندگان

  • Sachindra Joshi
  • Ganesh Ramakrishnan
  • Sreeram Balakrishnan
  • Ashwin Srinivasan
چکیده

We address an important deficiency in existing machine learning approaches for information extraction from natural language texts. Existing techniques for information extraction employ rules that exploit properties of consecutive word sequences. We argue that sequences of non-consecutive words capturing long range contextual correlations are vital features for information extraction from natural language text. We propose an efficient method that extends the a-priori algorithm to mine frequently occurring non-consecutive word sequences from a given corpus. We also perform a simplistic aggregation of feature information across multiple mentions of an entity in a document to avoid independent classification of the multiple occurrences of the entity. Experiments on some standard data sets show substantial improvements over previously reported

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Integrating Word Sequences and Dependency Structures for Chemical-Disease Relation Extraction

Understanding chemical-disease relations (CDR) from biomedical literature is important for biomedical research and chemical discovery. This paper uses a k-max pooling convolutional neural network (CNN) to exploit word sequences and dependency structures for CDR extraction. Furthermore, an effective weighted context method is proposed to capture semantic information of word sequences. Our system...

متن کامل

A Parallel Multikey Quicksort Algorithm for Mining Multiword Units

In the context of word associations, multiword units (sequences of words that co-occur more often than expected by chance) are frequently used in everyday language, usually to precisely express ideas and concepts that cannot be compressed into a single word. For instance, [Bill of Rights], [swimming pool], [as well as], [in order to], [to comply with] or [to put forward] are multiword units. As...

متن کامل

EXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS

Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...

متن کامل

Keyphrase Extraction and Grouping Based on Association Rules

Keyphrases are important in capturing the content of a document and thus useful for many natural language processing tasks such as Information Retrieval, Document Classification, and Text Summarization. Keyphrase extraction aims to identify multi-word sequences from a collection of documents that more or less correspond to keyphrases. In this paper, we propose a new method for keyphrase extract...

متن کامل

Automatic Extraction of Word Sequence Correspondences in Parallel Corpora

This paper proposes a method of finding correspondences of arbitrary length word sequences in aligned parallel corpora of Japanese and English. Translation candidates of word sequences are evaluated by a similarity measure between the sequences defined by the co-occurrence frequency and independent frequency of the word sequences. The similarity measure is an extension of Dice coefficient. An i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006